NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

APOLLO: SGD-like Memory AdamW-level Performance

Zhu, Hanqing; Zhang, Zhenyu; Cong, Wenyan; Liu, Xi; Park, Sem; Chandra, Vikas; Long, Bo; Pan, David Z; Wang, Zhangyang; Lee, Jinwon (May 2025, Conference on Machine Learning and Systems (MLSys))

Free, publicly-accessible full text available May 10, 2026
BEAST X for Bayesian phylogenetic, phylogeographic and phylodynamic inference

https://doi.org/10.1038/s41592-025-02751-x

Baele, Guy; Ji, Xiang; Hassler, Gabriel W; McCrone, John T; Shao, Yucai; Zhang, Zhenyu; Holbrook, Andrew J; Lemey, Philippe; Drummond, Alexei J; Rambaut, Andrew; et al (August 2025, Nature Methods)

Abstract Here we present the open-source and cross-platform BEAST X software that combines molecular phylogenetic reconstruction with complex trait evolution, divergence-time dating and coalescent demographics in an efficient statistical inference engine.BEAST Xsignificantly advances the flexibility and scalability of evolutionary models supported. Novel clock and substitution models leverage a large variety of evolutionary processes; discrete, continuous and mixed traits with missingness and measurement errors; and fast, gradient-informed integration techniques that rapidly traverse high-dimensional parameter spaces.
more » « less
Free, publicly-accessible full text available August 1, 2026
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Zhao, Jiawei; Zhang, Zhenyu; Chen, Beidi; Wang, Zhangyang; Anandkumar, Anima; Tian, Yuandong (July 2024, International Conference on Machine Learning (ICML))

Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank adaptation (LoRA), add a trainable low-rank matrix to the frozen pre-trained weight in each layer, reducing trainable parameters and optimizer states. However, such approaches typically underperform training with full-rank weights in both pre-training and fine-tuning stages since they limit the parameter search to a low-rank subspace and alter the training dynamics, and further, may require full-rank warm start. In this work, we propose Gradient Low-Rank Projection (GaLore), a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA. Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.
more » « less
Full Text Available
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Zhao, Jiawei; Zhang, Zhenyu; Chen, Beidi; Wang, Zhangyang; Anandkumar, Anima; Tian, Yuandong (July 2024, International Conference on Machine Learning (ICML))

Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank adaptation (LoRA), add a trainable low-rank matrix to the frozen pre-trained weight in each layer, reducing trainable parameters and optimizer states. However, such approaches typically underperform training with full-rank weights in both pre-training and fine-tuning stages since they limit the parameter search to a low-rank subspace and alter the training dynamics, and further, may require full-rank warm start. In this work, we propose Gradient Low-Rank Projection (GaLore), a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA. Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.
more » « less
Full Text Available
Sparsity-Guided Holistic Explanation for LLMs with Interpretable Inference-Time Intervention

https://doi.org/10.1609/aaai.v38i19.30160

Tan, Zhen; Chen, Tianlong; Zhang, Zhenyu; Liu, Huan (March 2024, Proceedings of the AAAI Conference on Artificial Intelligence)

Large Language Models (LLMs) have achieved unprecedented breakthroughs in various natural language processing domains. However, the enigmatic ``black-box'' nature of LLMs remains a significant challenge for interpretability, hampering transparent and accountable applications. While past approaches, such as attention visualization, pivotal subnetwork extraction, and concept-based analyses, offer some insight, they often focus on either local or global explanations within a single dimension, occasionally falling short in providing comprehensive clarity. In response, we propose a novel methodology anchored in sparsity-guided techniques, aiming to provide a holistic interpretation of LLMs. Our framework, termed SparseCBM, innovatively integrates sparsity to elucidate three intertwined layers of interpretation: input, subnetwork, and concept levels. In addition, the newly introduced dimension of interpretable inference-time intervention facilitates dynamic adjustments to the model during deployment. Through rigorous empirical evaluations on real-world datasets, we demonstrate that SparseCBM delivers a profound understanding of LLM behaviors, setting it apart in both interpreting and ameliorating model inaccuracies. Codes are provided in supplements.
more » « less
Full Text Available
QuantumSEA: In-Time Sparse Exploration for Noise Adaptive Quantum Circuits

https://doi.org/10.1109/QCE57702.2023.00015

Chen, Tianlong; Zhang, Zhenyu; Wang, Hanrui; Gu, Jiaqi; Li, Zirui; Pan, David Z.; Chong, Frederic T.; Han, Song; Wang, Zhangyang (September 2023, 2023 IEEE International Conference on Quantum Computing and Engineering (QCE))

Full Text Available
Accelerating Bayesian inference of dependency between mixed-type biological traits

https://doi.org/10.1371/journal.pcbi.1011419

Zhang, Zhenyu; Nishimura, Akihiko; Trovão, Nídia S.; Cherry, Joshua L.; Holbrook, Andrew J.; Ji, Xiang; Lemey, Philippe; Suchard, Marc A. (August 2023, PLOS Computational Biology)
Chekouo, Thierry (Ed.)
Inferring dependencies between mixed-type biological traits while accounting for evolutionary relationships between specimens is of great scientific interest yet remains infeasible when trait and specimen counts grow large. The state-of-the-art approach uses a phylogenetic multivariate probit model to accommodate binary and continuous traits via a latent variable framework, and utilizes an efficient bouncy particle sampler (BPS) to tackle the computational bottleneck—integrating many latent variables from a high-dimensional truncated normal distribution. This approach breaks down as the number of specimens grows and fails to reliably characterize conditional dependencies between traits. Here, we propose an inference pipeline for phylogenetic probit models that greatly outperforms BPS. The novelty lies in 1) a combination of the recent Zigzag Hamiltonian Monte Carlo (Zigzag-HMC) with linear-time gradient evaluations and 2) a joint sampling scheme for highly correlated latent variables and correlation matrix elements. In an application exploring HIV-1 evolution from 535 viruses, the inference requires joint sampling from an 11,235-dimensional truncated normal and a 24-dimensional covariance matrix. Our method yields a 5-fold speedup compared to BPS and makes it possible to learn partial correlations between candidate viral mutations and virulence. Computational speedup now enables us to tackle even larger problems: we study the evolution of influenza H1N1 glycosylations on around 900 viruses. For broader applicability, we extend the phylogenetic probit model to incorporate categorical traits, and demonstrate its use to studyAquilegiaflower and pollinator co-evolution.
more » « less
Full Text Available
Data-Efficient Double-Win Lottery Tickets from Robust Pre-training

Chen, Tianlong; Zhang, Zhenyu; Liu, Sijia; Zhang, Yang; Chang, Shiyu; Wang, Zhangyang (July 2022, International Conference on Machine Learning)

Full Text Available
Data-Efficient Double-Win Lottery Tickets from Robust Pre-training

Chen, Tianlong; Zhang, Zhenyu; Liu, Sijia; Zhang, Yang; Chang, Shiyu; Wang, Zhangyang (July 2022, International Conference on Machine Learning (ICML))

Pre-training serves as a broadly adopted starting point for transfer learning on various downstream tasks. Recent investigations of lottery tickets hypothesis (LTH) demonstrate such enormous pre-trained models can be replaced by extremely sparse subnetworks (a.k.a. matching subnetworks) without sacrificing transferability. However, practical security-crucial applications usually pose more challenging requirements beyond standard transfer, which also demand these subnetworks to overcome adversarial vulnerability. In this paper, we formulate a more rigorous concept, Double-Win Lottery Tickets, in which a located subnetwork from a pre-trained model can be independently transferred on diverse downstream tasks, to reach BOTH the same standard and robust generalization, under BOTH standard and adversarial training regimes, as the full pre-trained model can do. We comprehensively examine various pre-training mechanisms and find that robust pre-training tends to craft sparser double-win lottery tickets with superior performance over the standard counterparts. For example, on downstream CIFAR-10/100 datasets, we identify double-win matching subnetworks with the standard, fast adversarial, and adversarial pre-training from ImageNet, at 89.26%/73.79%, 89.26%/79.03%, and 91.41%/83.22% sparsity, respectively. Furthermore, we observe the obtained double-win lottery tickets can be more data-efficient to transfer, under practical data-limited (e.g., 1% and 10%) downstream schemes. Our results show that the benefits from robust pre-training are amplified by the lottery ticket scheme, as well as the data-limited transfer setting.
more » « less
Full Text Available
Improving Process Discovery Results by Filtering Out Outliers from Event Logs with Hidden Markov Models

https://doi.org/10.1109/CBI52690.2021.00028

Zhang, Zhenyu; Hildebrant, Ryan; Asgarinejad, Fatemeh; Venkatasubramanian, Nalini; Ren, Shangping (September 2021, 2021 IEEE 23rd Conference on Business Informatics (CBI))

Process Mining is a technique for extracting process models from event logs. Event logs contain abundant explicit information related to events, such as the timestamp and the actions that trigger the event. Much of the existing process mining research has focused on discovering the process models behind these event logs. However, Process Mining relies on the assumption that these event logs contain accurate representations of an ideal set of processes. These ideal sets of processes imply that the information contained within the log represents what is really happening in a given environment. However, many of these event logs might contain noisy, infrequent, missing, or false process information that is generally classified as outliers. Extending beyond process discovery, there are many research efforts towards cleaning the event logs to deal with these outliers. In this paper, we present an approach that uses hidden Markov models to filter out outliers from event logs prior to applying any process discovery algorithms. Our proposed filtering approach can detect outlier behavior, and consequently, help process discovery algorithms return models that better reflect the real processes within an organization. Furthermore, we show that this filtering method outperforms two commonly used filtering approaches, namely the Matrix Filter approach and the Anomaly Free Automation approach for both artificial event logs and real-life event logs.
more » « less
Full Text Available

« Prev Next »

Search for: All records